Search CORE

Universidade do Minho: RepositoriUM

Classification tools for carotenoid content estimation in Manihot esculenta via metabolomics and machine learning

Author: A Champagne
AE Hoerl
AJ Meléndez-Martínez
AL Chávez
C Costa
D Rodriguez-Amaya
H Zou
K Kljak
MR Frano La
SA Tanumihardjo
T Sánchez
V Svetnik
W Stahl
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/06/2017
Field of study

Cassava genotypes (Manihot esculenta Crantz) with high pro-vitamin A activity have been identified as a strategy to reduce the prevalence of deficiency of this vitamin. The color variability of cassava roots, which can vary from white to red, is related to the presence of several carotenoid pigments. The present study has shown how CIELAB color measurement on cassava roots tissue can be used as a non-destructive and very fast technique to quantify the levels of carotenoids in cassava root samples, avoiding the use of more expensive analytical techniques for compound quantification, such as UV-visible spectrophotometry and the HPLC. For this, we used machine learning techniques, associating the colorimetric data (CIELAB) with the data obtained by UV-vis and HPLC, to obtain models of prediction of carotenoids for this type of biomass. Best values of R2 (above 90%) were observed for the predictive variable TCC determined by UV-vis spectrophotometry. When we tested the machine learning models using the CIELAB values as inputs, for the total carotenoids contents quantified by HPLC, the Partial Least Squares (PLS), Support Vector Machines, and Elastic Net models presented the best values of R2 (above 40%) and Root-Mean-Square Error (RMSE). For the carotenoid quantification by UV-vis spectrophotometry, R2 (around 60%) and RMSE values (around 6.5) are more satisfactory. Ridge regression and Elastic Network showed the best results. It can be concluded that the use colorimetric technique (CIELAB) associated with UV-vis/HPLC and statistical techniques of prognostic analysis through machine learning can predict the content of total carotenoids in these samples, with good precision and accuracy.CAPES -Coordenação de Aperfeiçoamento de Pessoal de Nível Superior(407323/2013-9)info:eu-repo/semantics/publishedVersio

Random forest for gene selection and microarray data classification

Author: A.A. Alizadeh
D. Singh
D..T. Ross
J. Khan
J.W. Lee
L. Breiman
L.J. Veer van’t
S. Ramaswamy
S.L. Pomeroy
T. Li
T.R. Golub
U. Alon
V. Svetnik
Y.L. Chin
Publication venue: Biomedical Informatics
Publication date: 01/01/2011
Field of study

A random forest method has been selected to perform both gene selection and classification of the microarray data. In this embedded method, the selection of smallest possible sets of genes with lowest error rates is the key factor in achieving highest classification accuracy. Hence, improved gene selection method using random forest has been proposed to obtain the smallest subset of genes as well as biggest subset of genes prior to classification. The option for biggest subset selection is done to assist researchers who intend to use the informative genes for further research. Enhanced random forest gene selection has performed better in terms of selecting the smallest subset as well as biggest subset of informative genes with lowest out of bag error rates through gene selection. Furthermore, the classification performed on the selected subset of genes using random forest has lead to lower prediction error rates compared to existing method and other similar available methods

Universiti Teknologi Malaysia Institutional Repository

Predicting postoperative complications for gastric cancer patients using data mining

Author: A Biglarian
A Morais
A Rajput
C Zhang
D Delen
DM Roder
F Fonseca
H Brenner
HC Koh
I Witten
J Shim
M Khalilia
M Rugge
P Beeler
P Karimi
R Sitarz
S Tuffery
T Mitchell
V Svetnik
X Wu
Y Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Gastric cancer refers to the development of malign cells that can grow in any part of the stomach. With the vast amount of data being collected daily in healthcare environments, it is possible to develop new algorithms which can support the decision-making processes in gastric cancer patients treatment. This paper aims to predict, using the CRISP-DM methodology, the outcome from the hospitalization of gastric cancer patients who have undergone surgery, as well as the occurrence of postoperative complications during surgery. The study showed that, on one hand, the RF and NB algorithms are the best in the detection of an outcome of hospitalization, taking into account patients’ clinical data. On the other hand, the algorithms J48, RF, and NB offer better results in predicting postoperative complications.FCT - Fundação para a Ciência e a Tecnologia (UID/CEC/00319/2013

Universidade do Minho: RepositoriUM

Predicting Phospholipidosis Using Machine Learning

Author: Baldi P.
Bender A.
Breiman L.
Burbidge R.
Cannon E. O.
Cortes C.
Eichberg J.
Glen R. C.
Guyon I.
Hall M.
Halliwell W. H.
Ivanciuc O.
John B. O. Mitchell
Kruhlak N. L.
Lüllmann H.
Matthews B. W.
Nelson A. A.
Nioi P.
Pappu A.
Pelletier D. J.
Perez J. J.
Ploemen J.-P. H. T. M.
Reasor M. J.
Reasor M. J.
Reasor M. J.
Robert C. Glen
Robert Lowe
Rücker C.
Sawada H.
Svetnik V.
Tetko I. V.
Tomizawa K.
Weininger D.
Publication venue: American Chemical Society
Publication date: 01/01/2010
Field of study

Phospholipidosis is an adverse effect caused by numerous cationic amphiphilic drugs and can affect many cell types. It is characterized by the excess accumulation of phospholipids and is most reliably identified by electron microscopy of cells revealing the presence of lamellar inclusion bodies. The development of phospholipidosis can cause a delay in the drug development process, and the importance of computational approaches to the problem has been well documented. Previous work on predictive methods for phospholipidosis showed that state of the art machine learning methods produced the best results. Here we extend this work by looking at a larger data set mined from the literature. We find that circular fingerprints lead to better models than either E-Dragon descriptors or a combination of the two. We also observe very similar performance in general between Random Forest and Support Vector Machine models.</p

University of St. Andrews - Pure

An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

Author: A Vlahou
Alan R. Dabney
Anthony P. Leclerc
AR Dabney
B Efron
B Rosner
B Wu
BL Adam
C Strobl
C Strobl
D Agranoff
DS Palmer
EF Petricoin
EJ Finehout
Elizabeth G. Hill
ET Fung
Fabio Rapallo
G Izmirlian
GA Churchill
H Zhang
JM Koomen
Jonas S. Almeida
JR Quinlan
JS Morris
L Breiman
L Breiman
L Breiman
L Li
LE Breiman
M Hilario
MR Segal
PJ Adam
RW Garden
S Schaub
SK Lee
TM Pawlik
TP Conrads
V Svetnik
Y Yasui
YD Chen
Yuliya V. Karpievitch
YV Karpievitch
YV Karpievitch
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

Public Library of Science (PLOS)

Texas A&M Repository

Conditional variable importance for random forests

Author: A Bureau
Achim Zeileis
Anne-Laure Boulesteix
BJ van Os
C Strobl
C Strobl
C Strobl
Carolin Strobl
E Bauer
JH Silber
K Nicodemus
KJ Archer
KL Lunetta
L Breiman
L Breiman
L Breiman
L Breiman
L Breiman
M Nason
MR Segal
Mvan der Laan
P Bühlmann
P Good
R Development Core Team
R Diaz-Uriarte
R Diaz-Uriarte
R Feraud
SM Stigler
T Hastie
T Hothorn
TG Dietterich
Thomas Augustin
Thomas Kneib
V Svetnik
W Rodenburg
X Huang
X Xia
Y Lin
Y Qi
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Random forests are becoming increasingly popular in many scientific fields because they can cope with ``small n large p'' problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. The resulting conditional variable importance is shown to reflect the true impact of each predictor variable more reliably than the original marginal approach

CiteSeerX

Elektronische Publikationen der Wirtschaftsuniversität Wien

Open Access LMU

Bias in random forest variable importance measures: Illustrations, sources and a solution

Author: A Bureau
A Dobra
A Liaw
Achim Zeileis
AG Heidema
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Furlanello
C Strobl
C Strobl
C Strobl
Carolin Strobl
DN Politis
EC Gunther
H Kim
I Kononenko
J Friedman
J Friedman
K Arun
KL Lunetta
L Breiman
L Breiman
L Breiman
M van der Laan
MM Ward
MP Cummings
MP Cummings
MR Segal
P Bühlmann
PJ Bickel
R Development Core Team
R Díaz-Uriarte
R Guha
T Hothorn
T Hothorn
TM Therneau
Torsten Hothorn
V Svetnik
X Huang
Y Qi
Y Shih
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research

Elektronische Publikationen der Wirtschaftsuniversität Wien

Open Access LMU

Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers

Author: A Karatzoglou
A Kernytsky
A Liaw
A Sali
AG Heidema
B Gabrys
C Wong
D Heider
D Heider
D Heider
D Wolpert
Daniel Hoffmann
DK Worthylake
Dominik Heider
F Wilcoxon
H Naderi-Manesh
J Demsar
J Kyte
J Nikolaj Dybowski
J Verheyen
J Zhou
Jens Verheyen
JN Dybowski
K Salzwedel
K van Baelen
KC Chou
KM Ting
L Breiman
L Nanni
L Nanni
LI Kuncheva
M Kierczak
M Pyka
MA Wainberg
Martin Pyka
ML Calle
Mona Riemenschneider
N Beerenwinkel
N Beerenwinkel
N Morellet
N Qian
PW Keller
R Development Core Team
RJ Murray
S Draghici
S Džeroski
S Kawashima
Sascha Hauke
SY Rhee
T Fawcett
T Sing
V Svetnik
W Gronwald
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Maturation inhibitors such as Bevirimat are a new class of antiretroviral drugs that hamper the cleavage of HIV-1 proteins into their functional active forms. They bind to these preproteins and inhibit their cleavage by the HIV-1 protease, resulting in non-functional virus particles. Nevertheless, there exist mutations in this region leading to resistance against Bevirimat. Highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients, who might benefit from the usage of these new drugs. Results We tested several methods to improve Bevirimat resistance prediction in HIV-1. It turned out that combining structural and sequence-based information in classifier ensembles led to accurate and reliable predictions. Moreover, we were able to identify the most crucial regions for Bevirimat resistance computationally, which are in line with experimental results from other studies. Conclusions Our analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat. New maturation inhibitors are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Thus, accurate prediction tools are very useful to enable a personalized therapy.</p

Machine learning on normalized protein sequences

Author: A Altmann
A Kernytsky
AE Karnoub
AK Patick
B Liu
B Liu
C Strobl
C Torti
D Heider
D Heider
D Heider
D Wang
Daniel Hoffmann
DJ Kempf
Dominik Heider
F Wilcoxon
GC Cawley
GE Forsythe
GM Pao
H Lodhi
I Dubchak
IR Vetter
J Demsar
J Kjaer
J Kyte
J Pánek
Jens Verheyen
JN Dybowski
K Wang
KC Chou
L Breiman
L Nanni
M Borschbach
M Kierczak
M Kozisek
MA Jensen
ME Quinones-Mateu
N Beerenwinkel
N Beerenwinkel
N Beerenwinkel
N Qian
NS Shulman
O Haq
P Chowriappa
P Mundra
R Colonno
S Boisvert
S Ong
S Sonnenburg
S Xu
SY Rhee
T Fawcett
T Hou
T Sing
TB Thompson
V Svetnik
W Resch
Y Guo
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. Findings We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. Conclusions We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.</p